Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle errors from cluster_status #3735

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

clumens
Copy link
Contributor

@clumens clumens commented Nov 15, 2024

I'm a little undecided on this patch at the moment - check out the changes to regression test output in the last patch. I think some of that could be mitigated by not printing the warnings at all. The errors are a little trickier. We want to print them out if we're not verbose (that's the entire point of the related issue) but that means we get all of them.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had time for first commit so far

lib/pacemaker/pcmk_scheduler.c Show resolved Hide resolved
Use pcmk_unpack_scheduler_input instead.
… fails.

This function can return a couple error codes, most notably when called
on input with a feature set that is newer than the latest supported.  In
that case, the caller should return its own error instead f trying to
continue on with an unpopulated scheduler object.  This prevents a
cascade of error messages.
Also, there's no need to do any error reporting.  pcmk__config_err will
have already called crm_err in this case.
The error message is hidden and only gets displayed if -V is given on
the command line.  Adding config error/warning handlers will cause the
error to be displayed regardless.

This could have been implemented in a couple ways, and there's tradeoffs
here.  I've chosen to duplicate what's happening in crm_verify, but
instead of checking for verbosity (which is a global variable in that
file), I'm checking out->is_quiet.

This means that if you do `crm_simulate -Q`, you won't see the error
message but you will get an error return code.  This also means that
`crm_simulate -Q -VVVV...`, you still won't see the error message.  This
may be a bug, but I'm not sure who would do that and I also think these
sorts of problems are pervasive in our command line tools.

Fix T521
This is just like the previous patch for crm_simulate, complete with all
the same problems regarding -Q and -V.
This is just like the previous patch to crm_simulate.  However, one
additional problem here is that it relies on using the deprecated -Q
command line option.  On the other hand, I think this is okay because we
have a lot of work to do straightening out these sorts of options for
all our command line tools.  This is just one more thing we'll have to
deal with at that time.
This takes care of all callers of pcmk__output_cluster_status and
pcmk__status.  pcmk_status would also be affected, but at the moment
there are no users of that function and anyway the config error handlers
aren't public API.
The point of this is to allow it to return the value from unpack_cib,
which is returning the value from cluster_status.  This allows us to
check whether that function hit the too-new feature set CIB condition.
…tions.

This takes care of most callers - the ones in the daemons are unlikely
to be a problem.  This allows catching the too-new schema condition in
various other tools and displaying an error message to the user.

Note that a couple other callers don't need to check the return value.
I've added comments explaining why.
* Remove the leading function name from various messages.  This was most
  commonly "unpack_resources".

* In the XML output format, move various messages from text that gets
  printed out to the XML output itself.  This does end up with somewhat
  weird output with status="0" message="OK" followed by some error
  messages.

* Add a couple warnings to crm_resource output.
@clumens clumens force-pushed the cluster_status-errors branch from 07e8301 to b4f4b75 Compare November 19, 2024 17:18
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning to keeping the messages in crm_simulate output, they are issues that the user needs to know about

lib/pengine/status.c Show resolved Hide resolved
include/crm/pengine/status_compat.h Show resolved Hide resolved
lib/pacemaker/pcmk_simulate.c Show resolved Hide resolved
lib/pengine/status.c Show resolved Hide resolved

va_start(ap, msg);
pcmk__assert(vasprintf(&buf, msg, ap) > 0);
if (!out->is_quiet(out)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well just un-deprecate -Q

@@ -1730,6 +1757,7 @@ WARNING: Creating rsc_location constraint 'cli-ban-dummy-on-node1' with a score
=#=#=#= End test: Move a resource from its existing location - OK (0) =#=#=#=
* Passed: crm_resource - Move a resource from its existing location
=#=#=#= Begin test: Clear out constraints generated by --move =#=#=#=
warning: More than one node entry has name 'node1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this message accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like it, though I don't yet see what could be causing that.

cts/scheduler/summary/bug-lf-1852.summary Show resolved Hide resolved
@@ -1,3 +1,6 @@
error: Ignoring invalid node_state entry without id
warning: Ignoring failure timeout (10s) for rsc_pcmk-2 because it conflicts with on-fail=block
warning: Ignoring failure timeout (10s) for rsc_pcmk-4 because it conflicts with on-fail=block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the errors/warnings in the test cases are things we really should fix in the test case :(

(some may be testing the error though)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants